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Abstract — We develop a general framework for MAP es- 
timation in discrete and Gaussian graphical models using 
Lagrangian relaxation techniques. The key idea is to refor- 
mulate an intractable estimation problem as one defined on a 
more tractable graph, but subject to additional constraints. 
Relaxing these constraints gives a tractable dual problem, 
one defined by a thin graph, which is then optimized by an 
iterative procedure. When this iterative optimization leads to 
a consistent estimate, one which also satisfies the constraints, 
then it corresponds to an optimal MAP estimate of the original 
model. Otherwise there is a "duality gap", and we obtain a 
bound on the optimal solution. Thus, our approach combines 
convex optimization with dynamic programming techniques 
applicable for thin graphs. The popular tree-reweighted max- 
product (TRMP) method may be seen as solving a particular 
class of such relaxations, where the intractable graph is relaxed 
to a set of spanning trees. We also consider relaxations to a 
set of small induced subgraphs, thin subgraphs (e.g. loops), 
and a connected tree obtained by "unwinding" cycles. In 
addition, we propose a new class of multiscale relaxations that 
introduce "summary" variables. The potential benefits of such 
generalizations include: reducing or eliminating the "duality 
gap" in hard problems, reducing the number or Lagrange 
multipliers in the dual problem, and accelerating convergence 
of the iterative optimization procedure. 

I. Introduction 

Graphical models are probability models for a collection of 
random variables on a graph: the nodes of the graph represent 
random variables and the graph structure encodes conditional 
independence relations among the variables. Such models 
provide compact representations of probability distributions, 
and have found many practical applications in physics, sta- 
tistical signal and image processing, error-correcting coding 
and machine learning. However, performing optimal estima- 
tion in such models using standard junction tree approaches 
generally is intractable in large-scale estimation scenarios. 
This motivates the development of variational techniques to 
perform approximate inference, and, in some cases, recover 
the optimal estimate. 

We consider a general Lagrangian relaxation (LR) ap- 
proach to maximum a posteriori (MAP) estimation in graph- 
ical models. The general idea is to reformulate the estimation 
problem on an intractable graph as a constrained estimation 
over an augmented model defined on a larger, but more 
tractable graph. Then, using Lagrange multipliers to relax 
the constraints, we obtain a tractable estimation problem that 
gives an upper-bound on the original problem. This leads to a 
convex optimization problem of minimizing the upper-bound 
as a function of Lagrange multipliers. 
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We consider a variety of strategies to augment the original 
graph. The simplest approach breaks the graph into many 
small, overlapping subgraphs, which involves replicating 
some variables. Similarly, the graph can be broken into a set 
of thin subgraphs, as in the TRMP approach, or "unrolled" 
to obtain a larger, but connected, thin graph. We show 
that all of these approaches are essentially equivalent, being 
characterized by the set of maximal cliques of the augmented 
graph. More generally, we also consider the introduction of 
"summary" variables, which leads naturally to multiscale 
algorithms. We develop a general optimization approach 
based on marginal and max-marginal matching procedures, 
which enforce consistency between replicas of a node or 
edge, and moment-matching in the multiscale relaxation. 
We show that the resulting bound is tight if and only if 
there exists an optimal assignment in the augmented model 
that satisfies the constraints. In that case, we obtain the 
desired MAP estimate of the original model. When there is 
a duality gap, this is evidenced by the occurrence of "ties" 
in the resulting set of max-marginals, which requires further 
augmentation of the model to reduce and ultimately eliminate 
the duality gap. We focus primarily on discrete graphical 
models with binary variables, but also consider the extension 
to Gaussian graphical models. In the Gaussian model, we find 
that, whenever LR is "well-posed", so that the augmented 
model is valid, it leads to a tight bound and the optimal 
MAP estimate, and also gives upper-bounds on variances 
that provide a measure of confidence in the MAP estimate. 

II. Background 

We consider probabilistic graphical models [1], [2], [3], 
which are probability distributions of the form 



p{xi,...,x„) = ^exp{/(x)} 



^exp 



L fcixc) (1) 

I, C&f ) 

where each function fc only depends on a subset of variables 
xc = (-'^v , V £ C) and Z is a normalization constant of the 
model, called the partition function in statistical physics. If 
the sum ranges over all cliques of the graph, which are the 
fully connected subsets of variables, this representation is 
sufficient to realize any Markov model on ?f [1]. However, 
it is also common to consider restricted Markov models 
where only singleton and pairwise interactions are specified. 
In general, we specify the set of interactions by a hypergraph 

C 2^, where 2^ represents the set of all subsets of V. The 
elements of are its hyperedges, which generalizes the usual 
concept of a graph with pairwise edges. 

Discrete Models. While our approach is applicable for 
general discrete models, we focus on models with binary 



variables. One may use either the Boltzmann machine rep- 
resentation Xy e {0,1}, or that of the Ising model x,. G 
{ — 1,+1}. These models can be represented as in ([T]) with 

f{x-e) = £ 0£fe(x£), <I,e{xe) = Y{x, (2) 

£Gf# veE 

This defines an exponential family [4] of probability distri- 
butions based on model features and parameterized by 
9. <t>(0) = logZ(0) is the log-partition function and has 
the moment- generating property: ^^q^J = E9{0£(x)} = Tjg. 
Here, r\ are the moments of the distribution, which serve 
both as an alternate parameterization of the exponential 
family and, in graphical models, to specify the marginal 
distributions on cliques of the model. Inference in discrete 
models using junction tree methods, either to compute the 
mode or the marginals, is generally linear in the number of 
variables n but grows exponentially in the width of the graph 
[2], which is determined by the size of the maximal cliques 
in a junction tree representation of the graph. Hence, exact 
inference is only tractable for thin graphs, that is, where one 
can build an equivalent junction tree with small cliques. 

Gaussian Models. We also consider Gaussian graphical 
models [5], [6] represented in information form: 

p{x) =exp{-^x'^Jx + h'^x-<P{h,J)} (3) 

where J is the information matrix, h a potential vector and 
4>(/j,7) — jj/i^y^'/z — logdety + nlog2;r}. This corresponds 
to the standard form of the Gaussian model specified by the 
covariance matrix P = 7^' and mean vector x = J^^h. This 
translates into an exponential family where we identify {h,J) 
with the parameters 9 and {x,P) with the moments Tj. In 
general, the complexity of inference in Gaussian models is 
ff{n^). The fill pattern of J determines the Markov structure 
of the Gaussian model: {ij) £ if ^ 0. Using more 
efficient recursive inference methods that exploit sparsity, 
such as junction trees or sparse Gaussian elimination, the 
complexity is linear in n but cubic in the width of the graph, 
which is still impractical for many large-scale estimation 
problems. 

III. Discrete Lagrangian Relaxation 

To begin with, consider the problem of maximizing the 
following objective function, defined over a hypergraph C 
2^ based on a vertex set V = {!,..., n} corresponding to 
discrete variables x— {xi,...,x„). 

/(^) - E fsixE) (4) 

For instance, this may be defined as f{x) = (0,0(x)) in 
an exponential family graphical model, such that each term 
corresponds to a feature fE{xE) = 9e^e{xe)- Then, we seek 
X* to maximize f{x) to obtain the MAP estimate of ([T]i. 

An Illustrative Example. To briefly convey the basic con- 
cept, we consider a simple pairwise model defined on a 3- 
node cycle represented in Fig. [T] Here, the augmented 
graph is a 4-node chain, where node 4 is a replica of 
node 1. We copy all the potentials on the nodes and edges 




Fig. 1 . A simple illustrative example of Lagrangian relaxation. 

from to . For the replicated variables, x\ and x'^, we 
spHt /i between f[ and such that fi{y) = f[{y)+f^{y) 
for y e {0, 1}. Now the problem v(\?iXxf{x) is equivalent to 
maximizing f{x') subject to the constraint =x'^. To solve 
the latter we relax the constraint using Lagrange multipliers: 
L(x' ,X) — f' {x') + X (x[ — X4) . The additional term A {x\ — x'^ 
modifies the self-potentials: f[ ^ f[{x\) + Xx\ and /J ^ 
f!x{x'^ ~Xx'^, parameterizing a family of models on all of 
which are equivalent to / under the constraint x'^ = X4. For a 
fixed A, solving m2LXxL{x,X) ^ g{X) gives an upper bound 
on /* =maxv/(x), so by optimizing A to minimize g{X), 
we find the tightest bound g* = m\\\ig{X). If the constraint 
Xj — X4 is satisfied in the final solution, then there is strong 
duality g* — f* and we obtain the correct MAP assignment 
for fix). 

We now discuss the general procedure and develop our 
approach to optimize g{X) in more difficult cases. 

A. Obtaining a Tractable Graph by Vertex Replication 

In this section, we consider approaches that involve repli- 
cating variables to define the augmented model. The basic 
constraints in designing are as follows: is comprised 
of replicas of nodes and edges of Every node and edge 
of ^ must be represented at least once in Finally, 
should be a thin graph, which relates to the complexity of 
our method. 

To help illustrate the various strategies, we consider a 
pairwise model /(x) defined on 5 x 5 grid, as seen in 
Fig. 12 a). A natural approach is to break the model up into 
small subgraphs. The simplest method is to break the graph 
up into its composite interactions. For pairwise models, this 
means that we split the graph into a set of disjoint edges 
as shown in (b). Here, each internal node of the graph is 
replicated four times. To reduce the number of replicated 
nodes, and hence the number of constraints, it is also useful 
to merge many of these smaller subgraphs into larger thin 
graphs. One approach is to group edges into spanning trees 
of the graph as seen in (c). Here, each edge must be including 
in at least one tree, and some edges are replicated in multiple 
trees. The TRMP approach is based on this idea. One could 
also allow multiple replicas of a node in the same connected 
component of For instance, by taking a spanning tree 
of the graph and then adding an extra leaf node for each 
missing edge we obtain the graph seen in (d). 

It is also tractable to use small subgraphs that are not 
trees. We can break the graph into a set of short loops 
as in (e) or a set of induced subgraphs as in (f) where 
we select a set of 3 x 3 subgraphs that overlap on their 




Fig. 2. Illustrations of a variety of possible ways to obtain a tractable 
graph structure from a 5 x 5 grid by replicating some vertices of the graph. 

boundary. In such cases, including additional edges in the 
overlap of these subgraphs, such as the dotted edges in (f), 
can enhance the relaxations that we consider. Finally, we 
reduce the number of constraints in these formulations by 
again grouping subgraphs to form larger subgraphs that are 
still thin, as shown in (g). This will also lead to tractability 
in our methods. Again, it can be useful to include extra 
edges in the overlap of these subgraphs as in (h), although 
this increases the width of the subgraph and affects the 
computational complexity of our methods. 

Notation. Let denote the augmented graph (or collec- 
tion of subgraphs), which is based on an extended vertex set 
y', comprised of replicas of nodes in V . We assume that 
all edges of this graph are also replicas of edges of the 
original graph ^^iJ Thus, there is a well-defined surjective 
map r : ^ each edge E' G is a replica an edge 
E = r{E') E , and every edge of $^ has at least one such 
replica. This notation is overloaded for nodes by treating 
them as singleton edges of We also denote the set-valued 
inverse of F by ^{E) = F^' (£■), which is the set of replicas 
of E, and let r£ = \^M{E)\ denote the number of replicas. 
This defines an equivalence relation on A,B G are 
equivalent A = B if F(A) = F(B), that is, if A,B e ^{E) are 
replicas of the same edge £ G ^. 

B. Equivalent Constrained Estimation Problem 

We now define a corresponding objective function f'{x'), 
where x' — {x[,)y^v' are the variables of the augmented model. 

'in the case that we introduce extra edges in 5^', as in (f) and (h), we 
also add corresponding edges to 'if to maintain this convention. 



For each hyperedge E <E'^^ (including individual nodes), we 
split the function /^(x^) among a set of replica functions 
{fy,E' eM{E)}, requiring that these are consistent, 

fsixE)^ /£/(x£) for all jcfi. (5) 

E'€,ig{E) 

Using the parametric representation f{x) = {9,(p{x)), this 
consistency condition is equivalent to requiring 9e = ^e'^'e'- 
We will see that the LR approach to follow may be viewed as 
an optimization over all such possible consistent splittings. 
Next, we define the augmented objective function over the 
graph as 

/'(^')" E (6) 

This insures that f{x) = f'{x') where x' = ^(x) is the reph- 
cated version of x, defined by x',, ^x,, for all v' G 3?{v). This 
equivalence holds for all consistent configurations x' G C(^)' 
where x' is self-consistent over various replicas of the same 
node. Thus, we are led to an equivalent optimization problem 
in the augmented model subject to consistency constraints: 

r=max/(x)= max /'(x') (7) 

Expressing the consistency constraint as a set of linear 
constraints on the model features 0, we obtain; 

maximize /'(x') 

subject to <I>a{xa) = <pBix'g) for all A = B. 

Recall that, in the discrete binary model, these features are 
defined ^e{xe) = ^vg_eXv. Clearly, there is some redundancy 
in these constraints: x^ = Xh for all replicated nodes a = b 
would insure that the edges agree. However, these redun- 
dant edge-wise feature constraints do enhance the following 
relaxation. 

C. Lagrangian Relaxation 

We have now defined an equivalent model on a tractable 
graph. However, the equivalent constrained optimization is 
still intractable, because the constraints couple some vari- 
ables of spoiling its tractable structure. This suggests 
the use of Lagrangian duality to relax those complicating 
constraints. Introducing Lagrange multipliers Xa^b for each 
constraint, we define the Lagrangian, which is a modified 
version of the objective function: 

L(x',A) = fix') + E h.B (M) - Mx'b)) (9) 

A=B 

Grouping terms by edges E G and using /^(xs) = 
Oe^e{xe), this is represented 

L(x',A)= £ /^(x^;A) 

£Gf/' 

/^(x^;A) = 0^(A)fe(x^) 

^£ (■^ ) = ^£ + "^B^Efi — '^A^Afi (1 0) 



Note that the Lagrange multipHers may be interpreted as 
parameterizing all consistent splittings, 9'{X) spans the sub- 
space of all consistent 9' parameters]! 

It is tractable to maximize the Lagrangian, as it is defined 
over the thin graph The value of this maximization 
defines the dual function: 

g(A) =maxL(ji:',A) (11) 

Note that this is an unconstrained optimization over X', and 
its solution need not lead to a consistent x' G C(^)- However, 
if this x' is consistent then it is an optimal solution of the 
constrained optimization problem and hence x = ^^\x') 
(which is well-defined for consistent x') is also an optimal 
solution of the original problem. This is the goal of our 
approach, to find tractable relaxations of the MAP estimation 
problem which lead to the correct MAP estimate. This 
motivates solution of the dual problem: 

rmng{X)^g* (12) 

A 

Appealing to well-known results [7], [8], we conclude: 

Proposition 1 (Lagrangian duality): We have g{X) > f* 
for all X. Hence g* > f*. If g{X*) = g*, then one of the 
following holds: 

(i) There exists a consistent solution: 

x' G argmaxL(x';A*) n aX). 

Then, we have strong duality g* = /* and the set of 
all MAP estimates is obtained as: 

arg max /'fx') = argmaxLfx', A*) n C(X). 

.v'ef(x)' x'ex 

(ii) There are no consistent solutions: 

argmaxL(x';A*)nC(X) =0. 

a-'gX' 

Then, there is a duality gap g* > /* and no choice of 

A will provide a consistent solution. 
Also, condition (i) holds only if g{X*) = g*. 

This result generalizes the analogous strong tree- 
agreement optimality condition for TRMP, and clarifies 
its connection to standard Lagrangian duality results for 
integer programs. To provide some intuition, we present 
the following geometric interpretation illustrated in Fig. [3] 
The dual function is the maximum over a finite set of 
linear functions in A indexed by x'. For each x' G X', 
there is a linear function g{X;x') — {a{x'),X) +b{x'), with 
fl(x') = {<Pa{x') — ^b{x'))a=b, which is the gradient, and 
b{x') —f'(x'). The graph of each of these functions defines 
a hyperplane in R''+', where d is the number of constraints. 
The flat hyperplanes, with a — Q, correspond to consistent 
assignments x' G C(^)- The remaining sloped hyperplanes 
represent inconsistent assignments. Hence, the highest flat 
hyperplane corresponds to the optimal MAP estimate, with 
height equal to /*. The dual function ^(A) is defined by the 

^We obtain a minimal A parameterization by only using a subset of 
constraints in (9j, such that {(<j>^(x') — 0g(y))} are linearly independent. 
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Fig. 3. Illustration of the Lagrangian duality in the cases that (a) there is 
a duality gap and (b) there is no duahty gap (strong duality holds). 

maximum height over this set of hyperplanes for each A, 
and is therefore convex, piece-wise linear and greater than 
or equal to /* for all A. In the case of a duality gap, the 
inconsistent hyperplanes hide the consistent ones, as depicted 
in (a), so that the minimum of the dual function is defined 
by an intersection of slanted hyperplanes corresponding to 
inconsistent assignments of x'. If there is no duality gap, 
as depicted in (b), then the minimum is defined by the 
flat hyperplane corresponding to a consistent assignment. Its 
intersection with slanted hyperplanes defines the polytope of 
optimal Lagrange multipliers over which the maximum flat 
hyperplane is exposed. 

D. Linear Programming Formulations 

We briefly consider a connection between this LR pic- 
ture and TRMP [9], [10] and related linear programming 
approaches [11], [12], [13]. This analysis also serves to 
understand when different relaxations of the MAP estimation 
problem will be equivalent. 

The epigraph of the dual function is defined as the set 
of all points (A,/i) = M''+' where g{X) < h, that is, where 
a{x')X -+b{x') < h for all x' . Thus, the minimum of the dual 
function is equal to the lowest point of the epigraph, which 
defines a linear program (LP) over (A,/i) G M'^+': 

minimize h 

subject to (fl(x'),A) < h for afl x' . 

Note that there are exponentially many constraints in this 
formulation, so it is intractable. However, recalling that it is 
tractable to compute the dual function for a given A, using 
the max-product algorithm applied to the thin graph , we 
seek a more tractable representation of this LP. To achieve 
this, we consider the LP dual problem obtained by dualizing 



the constraints, which is always tight [8]. This LP dual should 
be distinguished from our Lagrangian dual (fT2] i that is the 
subject of our paper 

Introducing non-negative Lagrange multipliers > 

for each inequality constraint, indexed by x' G X', we obtain 
the LP Lagrangian: 

M{h,X;iJ.) = h + ii[{a{x'),X)+b{x')-h] 

= + (14) 

where jJ. denotes /i -weighted summation, e.g., /x[fl] — 
Y.x' l-L(x')a{x'). The LP dual function is then: 



M*(Ai): 



:minM(/z,A;/i) : 



IJ.[b], = 1 and ii[a] 

— oo otherwise. 



(15) 

Note that ^ > and = 1 imply that ;U is a probabil- 
ity distribution and an expectation operator Recalling 
alx') = {<j)A{x')-Mx'),A=B) andb{x')=f'{x'), we obtain 
the dual LP: 



maxM*(Ai) = 



maximize jj, [/'] 

subject to li[(pA]= [(Pb] for A=B 



(16) 

We seek a probability distribution over all configurations 
of the augmented model that maximizes the expected value 
of f'{x') subject to constraints that the moments specifying 
marginal distributions are consistent for replicated nodes 
and edges of the graph. This is a convex relaxation of the 
constrained version of problem (4), where the objective and 
constraint functions have been replaced by their expected 
values under jx. Note that only marginals /x^/ over hyper- 
edges £■ G are needed to evaluate both the objective and 
the constraints of this LP. Hence, it reduces to one defined 
over the marginal polytope ./# (^^') [9], defined as the set of 
all realizable collections of marginals over the hyperedges of 

. Moreover, if the graph is chordal [2], then its marginal 
polytope has a simple characterization. Let ^iocai(^^') denote 
the local marginal polytope defined as the set of all edge- 
wise marginal specifications that are consistent on intersec- 
tions of edges. In general, .^(^^') C ^iocai(^')- However, 
in chordal graphs it holds that J^{^') = .^iocai(^^')- Thus, 
if is a thin chordal graph, we obtain a tractable LP whose 
value is equivalent to g* in our frameworkU 

One last step shows the connection to LP approaches [9], 
[11], [12]. The key observation is that, roughly speaking, 

^local(^') n = = B} = .^local(^)- (17) 

This is seen by replicating marginals from to , or by 
copying (consistent) replicated marginals back to . For such 
consistent \l, we have \l\f\ = \l\f\, which gives: 

g* ^ max ^[/] > max \i\f\=f. (18) 



The maximum over ^iocai(^) gives an upper-bound on the 
maximum over ^i^) C ^iocai(^)- The latter is equivalent 

^ Some graphs shown in Fig. |2] are not chordal, but they can be extended 
to a thin chordal graph by adding a few edges. If no two of these new edges 
are equivalent when mapped into 'S , then this does not change g'. 
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Fig. 4. Illustration of the "log-sum-exp" smooth approximation of the dual 
function, as a function of "temperature" T, and of an optimization procedure 
for minimizing the non-smooth dual function through a sequence of smooth 
minimizations. 

to exact MAP estimation and the bound becomes tight if 
is the set of maximal cliques of a chordal graph. This 
discussion leads to the following characterization of LR: 

Proposition 2 (LR Hierarchy): Equivalence: Let ^^j' and 
'^2 be the set of maximal cliques of two chordal augmented 
graphs. If r^' (^^i) = r^' (^^2) then g\ = g*2 for the respective 
dual problems. Let g*{'i^) denote the common dual value 
of all such chordal relaxations where r^'(^^') = Mono- 
tonicity: If % C ^2 then g*{'S\) > g*{'^2)- Strong Duality: 
If is the set of maximal cliques of a chordal graph, then 

g*m=r. 

E. Smooth Relaxation of the Dual Problem 

In this section, we develop an approach to solve the 
dual problem. One approach to minimize g{X) is to use 
non-smooth optimization methods, such as the subgradient 
method [8]. Here, we consider an alternative, based on the 
following smooth approximation of ,g(A): 



g(A;T) = Tlog 52 exp 



L{x'-X) 



(19) 



As illustrated if Fig. HI the parameter T > controls the 
trade-off between smoothness of g(A;T) and how well it 
approximates g{X). This is known as the "log-sum-exp" 
approximation to the "max function" [14]: 



g{X) <g{l;i) <g(A) + Tlog|X| for all t > 0. 



(20) 



Hence, g{X;T) §(A) uniformly as T ^ and, hence, 
g*{x) = mi\\xg{X\T) converges to g*. 

The function g(A;T) has another useful interpretation. 
Consider the Gibbs distribution defined by 



M,t(-^') =exp 



L(x',A)-g(A;T) 



(21) 



Here, T > is the "temperature" and §(A;t) normalizes the 
distribution for each choice of A and T, and is equal to the 
Helmholtz free energy ^h{0') = T<I>t(0')' where 4>t(0') = 
logEexp(T^' (0', ^'(jic'))) is the usual log-partition function. 
Thus, ^(A;t) is a strictly convex, analytic function. Using 
the moment-generating property of 4>t(0'), the gradient of 



ALGORITHM 1 (Discrete LR) 

Iterate until convergence: 
For £■ e where > 1 

For E' e M{E) 

ft,E'{x'E') = Tlog/?^ji(x^,) 

end 

/t,£ {xe ) = ^ Le' ft.E' {XE ) 

For E' e £g{E) 

/e' {xe ) ^ /e' (xe ) + (/t,£' (.Xfi ) - fr.E' {xE ) ) 

end 
end 



§(A;t) is computed as: 

= Px.x[<^a]- Px,x[M (22) 

where we use p\-] to denote expectation under p. Thus, 
appealing to strict convexity, there is a unique X*{x) that 
minimizes g{X;T) and it is also the unique solution of the 
set of moment-matching conditions: 

PxA^a] = PxAMi for all A=B. 

These moment-matching conditions are equivalent to requir- 
ing that the marginal distributions Px.x{xa) and px^xb) 
are equal for xa — xb- We also note that '^^^^^'^'^ ~ 
PA.Ti^logPx.r]' which is the entropy of px^x and is positive 
for all X. Hence, for a decreasing sequence T^; > converging 
to zero, g{X;T) converges monotonically to g{X). Likewise, 
g*{T:k) converges monotonically to g* . 

Rather than directly optimizing g{X), we instead perform 
a sequence of minimizations with respect to the functions 
g{X;Xi^). At each step, the previous estimate of X^ = 
argming(A; Tj^) is used to initialize an iterative method to 
minimize g(A;Tjt+i). This is illustrated in Fig. ID At each 
step, we use the following optimization procedure based on 
the marginal agreement condition. 

1) Iterative Log-Marginal Averaging: To minimize 
g(X\x) for a specified T, starting from an initial guess for 
X (or, equivalently, an initial splitting of /), we develop 
a block coordinate-descent method. Our approach is in the 
same spirit as the iterative proportional fitting procedure [15]. 

We begin with the case that the augmented model is 
defined so that no two replicas of a node are contained 
in the same connected component of . Then, at each 
step, we minimize over the set of all Lagrange multipliers 
associated with features defined within any replica of E. This 
is equivalent to solving the condition that the corresponding 
marginal distributions px^x'^,) are consistent for all E' G 
1%{E). Algorithm 1 summarizes the method, which involves 
computing the log-marginal of each replica edge, and then 
updates the functions according to the rule: 

fs' (xe ) ^ Ie' {xe ) + {f%,E {xE ) - f%,E' {xE ) ) (23) 



where 

fT,E'{x'E') = T:iogpx,r{x'E>), fT,E{xE) = rE^ ^ fx,E'{xE)- 

E'e^{E) 

After the update, the new log-marginals of all replicas E' 
are equal to fr^E- Also, these updates maintain a consistent 
representation: 'E,E'{fT.E — ft.E') = 0- To handle augmented 
models with multiple replicas of E in the same connected 
subgraph, we only update a subset of replicas at each step, 
where no two replicas are in the same subgraph. In some 
cases, this requires including an extra replica of E to act as 
an intermediary in the update step. 

Each step of the procedure requires that we compute the 
marginal distributions of each replica £' in their respective 
subgraphs. In the graphs are thin, these marginals can be 
computed efficiently, with computation linear in the size of 
each subgraph, using standard belief propagations methods 
and their junction tree variants. Moreover, if we take some 
care to store the messages computed by belief propagation, 
it is possible to amortize the cost of this inference, by only 
updating a few "messages" at each step. In fact, it is only 
necessary to update those messages along the directed path 
from the last updated node or edge to the location in the 
tree (or junction tree) of the node or edge currently being 
updated. We find that this generally allows a complete set of 
updates to be computed with complexity linear in n. Similar 
ideas are discussed in [16]. 

Using Algorithm 1, together with a rule to gradually 
reduce T, we obtain a simple algorithm which generates a 
sequence Ajt such that g{Xii) converges to g* and A^. converge 
to a point in the set of optimal Lagrange multipliers. 

2) Iterative Max-Marginal Averaging: We now consider 
what happens as T approaches zero. The main insight is 
that the (non-normalized) log-marginals converge to max- 
marginals in the limit as T approaches zero: 

f\E'{x'E')+g{^,^) fE'{x'E') ^raeixf{x'^,,x'r,X) (24) 

Hence, as T becomes small, the marginal agreement con- 
ditions are similar to a set of max-marginal agreement 
conditions among all replicas of an edge or node. One could 
consider a "zero-temperature" version of Algorithm 1 aimed 
at solving these max-marginal conditions directly: 

fk' {xe) ^ /e' {xe) + {Ie (xe) - Ie' (xe)) 
fE{xE) = rE^Y^fE'{xE) (25) 

Here, is the averaged max-marginal over all replicas of 
E. Note that fE'{xE) > fE{xE) — niaX;c,^^ /(x) for all xe and 
E' G ^{E), which implies fE{xE) > fE{xE)- This "zero- 
temperature" approach has close ties to max-sum diffusion 
(see [13] and reference therein) and Kolmogorov's serial 
approach to TRMP [16]. 

In our framework, one can show that X* = limT^o A* (t) is 
well-defined and minimizes g{X). This point X* also satisfies 
the max-marginal agreement condition and is therefore a 



fixed point of max-marginal averaging. However, the max- 
marginal agreement condition by itself does not uniquely 
determine A* and, in fact, is not sufficient to insure that 
g{X) is minimized (this is related to the existence of non- 
minimal fixed-points observed by Kolmogorov). Hence, our 
approach to minimize ^(A;t) while gradually reducing the 
temperature has the advantage that it cannot get stuck in such 
spurious fixed-points. It also helps to accelerate convergence, 
because the initial optimization at higher temperatures serves 
to smooth over irregularities of the dual function. 

Computational Examples. In this section we provide some 
preliminary results using our approach to solve binary MRFs. 
These examples are for a binary model x,, e { — 1,+1} 
defined on a 10 x 10 grid similar to the one seen in Fig.|2a). 
For each node, we include a node potential fv{xy) — O^x^ with 
eyr^N{0,a^). For each edge, we include an edge potential 
fu,vixu,Xv) = duvXuXv with 0„„ = 1 in the "attractive" model 
and random 0„,, = ±1 in the "frustrated" model. Hence, 
(7 controls the strength of node potentials relative to edge 
potentials. As seen in Fig.|5] we obtain strong duality g* —f* 
and recover the correct MAP estimates in attractive models. 
This is consistent with a result on optimality of TRMP in 
attractive models [10]. In the frustrated model, the same 
holds with strong node potentials, but as a is decreased 
the frustration of the edge potentials cause a duality gap. 
However, even in these cases, we have observed that some 
nodes have a unique maximum in their re-summed max- 
marginals, and these nodes provide a partial MAP estimate 
that agrees with the correct global MAP estimate. This is 
apparently related to the weak tree agreement condition for 
partial optimality in TRMP [10]. 

IV. Gaussian Lagrangian Relaxation 

In this section we apply the LR approach to the problem 
of MAP estimation in Gaussian graphical models, which is 
equivalent to maximizing a quadratic objective function 

f[x\h,J) = + (26) 

where 7 ;^ is sparse with respect to Again, we con- 
struct an augmented model, which is now specified by an 
information form {h' ,J'), defined by a larger graph For 
consistency, we also require f'{tl,{x);h'.J') — f{x;h,J) for 
all X. Denoting variable replication by ^{x) — Ax, this is 
equivalent to A^J'A = J and A^h' — h. In order for the dual 
function to be well-defined, we also require that J' >~ 0. For 
general 7 ;^ 0, it is possible that, for a given augmented graph 
, there do not exist any J' >~ Q defined on such that 
A^ J'A = J. To avoid this issue, we will focus on models that 
are of the form: 

f{x) = fEi^E) (27) 

where ^ is a hyper-graph, composed of cliques of 
and each term fsixE) is itself a quadratic form fEixs) = 
~ ^x]^Jexe + Ii^xe based on Je >- 0. Then, J = Y.e[Je]v is the 
sum of these (zero-padded) submatrices. Then, it is simple to 



obtain a valid augmented model. We split each Je between 
its replicas as J^i = r^^^JE to obtain J' ~ Y.E'e^^'[JE'\v' ^ 0- 
If there exists a representation of / in terms of 2 x 2 
pairwise interactions Je >- 0, it is said to be pairwise normal- 
izable. This condition is equivalent to the walk-summability 
condition considered in [17], which is related to the conver- 
gence (and correctness) of a variety of approximate inference 
methods [17], [18]. Here, we show that for the more general 
class of models of the form ( l27l ). we obtain a convergent 
iterative method for solving the dual problem that is tractable 
provided the cliques are not too large. Moreover, for this 
class of Gaussian models, we show that there is no duality 
gap and we always converge to the unique MAP estimate 
of the model. As an additional bonus, we also find that, 
by solving marginal agreement conditions in the augmented 
Gaussian model, we obtain a set of upper-bounds on the 
variances of each variable, although these bounds are often 
rather loose. 

A. Gaussian LR with Linear Constraints 

We begin by considering the Lagrangian dual of the 
following linearly-constrained quadratic program: 

maximize —^x'^J'x' + h'^x' 
subject to x[i — x'l^ for all a = b. 

We may express the linear constraints on x' as Hx' = 0. Re- 
laxing these constraints leads to the following dual function: 

g{X) = max{-ix'Vx' + (/i' + //^A)x'} 

x' 

= \{h' + H'^Xfj'-\h' +h'^X) (29) 

Moreover, by strong duality of quadratic programming [7], 
it holds that g* = f*. We also note the following equivalent 
representation of the dual problem: 

^ f minimize jh'^J'^^h' 
^ ~{ subject to A^h'^h ^ ' 

Here, h' is the problem variable, and we consider all possible 
choices of h' that are consistent with h under the constraint 
x' ~ Ax. The optimal choice of h' in this problem is the one 
which leads to consistency in the estimate x' —J'^^h'. 

B. Quadratic Constraints and Log-Det Regularization 

Although, in Gaussian models, it is sufficient to include 
only linear constraints (there is no duality gap), our method 
can also accommodate quadratic constraints, and this results 
in faster convergence and tighter bounds on variances. Con- 
sider the constrained optimization problem: 

maximize — ^xf^ J'x' + h'^x' 

subject to Xa = xy^x^j = xl for all a = b, 

XaiXa2 =XbiXh^ for all (fli,a2) = {b\M)- 

(31) 

This leads to the following equivalent version of the dual 
problem with problem variables {h' ,J'): 

minimize jh'^J'^^h' 

subject to A^h' = h, A^J'A ^J,J'y 0. ' 
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Fig. 5. Five examples for discrete LR showing: (top row) convergence of to g* compared to /* (horizontal line); (bottom row) the resulting estimates 
generated by relaxed max-marginals (grey ai'eas denote non-unique maximum). The first two columns are examples of attractive models with (7 — 2 and 1. 
The last three columns are frustrated models with O" — 1.5, 1, and .7. 



ALGORITHM 2 (Gaussian LR) 

Iterate until convergence: 
For E E'^ where > 1 

For E' e Sg{E) 

Compute moments (i^/jP^/) in {h',J') 

Je' — /"^E' 
end 



JE = rE^ Le' Je' , hE = ' Y.E' hh 



For E' e M{E) 



■'e',e' 
h'E' 



J pi 



E'M' 



{Je-Je') 



end 
end 



Any solution of the linearly-constrained relaxation provides 
a feasible point for this problem, so the value of (|32] | is less 
than or equal to that of dSOl l. However, since there is no 
duality gap in ( l30l l. the value of the two problem are equal, 
both achieve g* — f* and obtain the MAP estimate. 

While the choice of J' does not affect the value of the dual 
problem, it does effect variance estimates and convergence of 
iterative methods. Hence, we regularize the choice of J' by 
adding a penalty — ^logdet/' to the objective of ( [32] ). which 
also serves as a barrier function enforcing J' >- 0. The result- 
ing objective function is then equivalent to <t>(/i,y), which 
shows a parallel to our earlier approach for "smoothing" the 
dual function in discrete problems. 

C. Gaussian Moment-Matching 

We develop an approach in the same spirit as the Gaussian 
iterative scaling method [6]. We minimize the log-partition 
function with respect to the information parameters over 
all replicas of a node or edge, subject to consistency and 
positive definite constraints. The optimality condition for 
this minimization is that the marginal moments (means and 
variances) of all replicas are equalized. It can be shown 
that the following information-form updates achieve this 
objective. First, for all replicas E' of E, we compute the 
marginal information parameters given by sparse Gaussian 



elimination of C = V \£" in {f ,h'): 

Je' = J'e'.e' ^J'E'.ciJc.c) ^J'cE' 
hE' = h'j,,~j'j,,(,{Jccy^h'c 



(33) 



This is equivalent to 7^/ = P^i and h^i ^P^, x^i. Next, we 
average these marginal information forms over all replicas: 



Je 



Y^Je', hi 



. -1 



(34) 



E' E' 

Finally, we update the information form according to: 



J 



E'.E' 
h'E' 



J'e'.e' + {Je —Je') 
h'E' + [hE-hE') 



(35) 



Using the characterization of positive-definiteness of a block 
matrix in terms of a principle submatrix and its Schur 
complement, it can be shown that this update preserves 
positive definiteness of J'. It also preserves consistency, 
e.g., Y.e'{Je ^ Je') = 0- After the update, the new marginal 
information parameters for all replicas of E are equal to 
(IieJe)- Algorithm 2 summarizes this iterative approach for 
solving the Gaussian LR problem. 

Lastly, using the fact that /^(xs) > fE{xE) for all xe and 
that there is no duality gap upon convergence, we conclude 
that the final equalized marginal information must satisfy 
Je d^JE = Je.e ~ JE.\E{J\E.\Ey^J\E.,E- Hence, LR gives an 
upper-bound on the true variance: Pe — (Je)^^ d: (-/s) 
If each replica of E is contained in a separate connected 
component of then a tighter bound holds: Pe d {fE-fE)^^ ■ 

Computational Examples. We apply LR for two Gaussian 
models defined on a 50 x 50 2D grid with correlation lengths 
comparable to the size of the field. First, we use the thin 
membrane model, which encourages neighboring nodes to 
be similar by having potentials fij = (x,- ~XjY for each edge 
G We split the 2D model into vertical strips of 
narrow width K, which have overlap L (we vary K and set 
L = 2). We impose marginal agreement conditions m K x L 
blocks in these overlaps. The updates are done consecutively, 
from top to bottom blocks, from the left to the right strip. 
A full update of all the blocks constitutes one iteration. We 
compare LR to loopy beUef propagation (LBP). The LBP 
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Fig. 6. Convergence plots for variances (left) and means (right), in the 
thin-membrane model (top) and thin-plate model (bottom). 



variances are underestimates by 21.5 percent (averaged over 
all nodes), while LR variances for K are overestimates 
by 16.1 percent. In Figure |6] (top) we show convergence of 
LR for several values of K, and compare it to LBP. The 
convergence of variances is similar to LBP, while for the 
means LR converges considerably faster. In addition, the 
means in LR converge faster than using block Gauss-Seidel 
on the same set of overlapping K x5Q vertical strips. 

Next, we use the thin plate model, which enforces that 
each node v is close to the average of its nearest neighbors 
N{v) in the grid, and penalizes curvature. At each node there 
is a potential: /;(x,-,X;v(i)) = (^'' - ]W{I)\'Li<^N{i)Xif ■ LBP does 
not converge for this model. LR gives rather loose variance 
bounds for this more difficult model: for K ~ 12, it overes- 
timates the variances by 75.4 percent. More importantly, it 
accelerate convergence of the means. In Figure |6] (bottom) we 
show convergence plots for means and variances, for several 
values of K. As K increases, the agreement is achieved 
faster, and for K = \2 agreement is achieved in under 13 
iterations for both means and variances. We note that LR 
with K = 4 converges much faster for the means than block 
Gauss-Seidel. 

V. Multi-Scale Lagrangian Relaxation 

In this section, we propose an extension of the LR method 
considered thus far. Previously, we have considered relax- 
ations based on augmented models where x' = ^{x) involves 
replication of variables. Here, we consider more general 
definition of ^ to allow the augmented model to include 
summary variables, such as a sum over a subset of variables, 
or any linear combination of these. In discrete models, 
summary variables can also be non-linear functions of x. 
For example, "parity bits" are used in coding applications 
and the "majority rule" is used to define coarse-scale binary 
variables in the renormalization group approach [19]. 

Using this idea, we develop a multiscale Lagrangian 
relaxation approach for MRFs defined on grids. The purpose 
of this relaxation is similar to that of the multigrid and 
renormalization group methods [20], [19]. Iterative methods 
generally involve simple rules that propagate information 
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Fig. 7. Illustration of multiscale LR method, (a) First, we define an 
equivalent multiscale model subject to cross-scale constraints. Relaxing 
these constraints leads to a set of single-scale models, (b) Next, each single 
scale is relaxed to a set of tractable subgraphs. 

locally within the graph. Using a multiscale representation 
of the model allows information to propagate through coarse 
scales, which improves the rate of convergence to global 
equilibrium. Also, in discrete problems, such multiscale 
representations can help to avoid local minima. In the context 
of our convex LR approach, we expect this to translate into 
a reduction of the duality gap to obtain the optimal MAP 
estimate in a larger class of problems. 

A. An Equivalent Multiscale Model 

We illustrate the general idea with a simple example based 
on a ID Markov chain. While this case is actually tractable 
by exact methods, it serves to illustrate our approach, which 
generalizes to 2D grids and 3D lattices. In Fig. |7] we show 
how to construct the augmented model f'{x') defined on a 
graph This is done in two stages. 

First, as illustrated in Fig. [Tfa), we introduce coarse-scale 
representations of the fine scale variables by recursively 
defining summary variables at coarser scales to be functions 
of variables at the next level down. This defines a set of cross- 
scale constraints, denoted by the square nodes. To allow 
interactions between coarse-scale variables, while maintain- 
ing consistency with the original single-scale model, we 
introduce extra edges (the dotted ones in Fig. |7|a)) between 
blocks of nodes that have a (solid) edge between their 
summary nodes at the next coarser scale. This representation 
allows us to define a family of constrained multiscale models 
that are all equivalent to the original single-scale model. 
For 2D and 3D lattices, this model is still intractable even 
after relaxing the cross-scale constraints because each scale 
is itself intractable. 

Next, to obtain a tractable dual problem, we break up the 
graph into smaller subgraphs, introducing additional con- 
straints to enforce consistency among replicated variables. 
In the example, we break the augmented graph at each scale 
into its maximal cliques, shown in Fig.|7jb). This defines the 
final augmented model and the corresponding graph. In a 2D 
graph, the same idea applies, but we obtain a set of maximal 
cliques consisting of overlapping 2x4 and 4x2 blocks of 



the grid. Alternatively, we could break up the 2D grid into 
a set of width 2 vertical strips, as discussed previously. 

Now, the procedure is essentially the same as before. We 
start with the equivalent constrained optimization problem 
defined on the augmented graph, now subject to both in-scale 
and cross-scale constraints. We obtain a tractable problem 
by introducing Lagrange multipliers to relax these con- 
straints. Then we iteratively adjust the Lagrange multipliers 
to minimize the dual function, with the aim of eliminating 
constraint violations to obtain the desired MAP estimate. 
This is equivalent to adjusting the augmented model f'{x') 
on subject to the constraint that it remains equivalent to 
/(x) for all x' = Ax. 

B. Gaussian Multiscale Moment-Matching 

We demonstrate this approach in the Gaussian model. To 
carry out the minimization, we again use a block coordinate- 
descent method that finds an exact minimum over a subset of 
Lagrange multipliers at each step. The replica constraints are 
handled the same as before. Here, we briefly summarize our 
approach to handle the cross-scale summary constraints. Let 
x\ and X2 denote two random vectors at consecutive scales 
coupled by the constraint X2 =Axi. Let {h\,J\) and (/i2,-/2) 
denote their corresponding marginal information parameters. 
Relaxing the constraints X2 = Axi and X2X2 = AxixjA^ , 
with Lagrange multipliers (A, — jA), leads to the following 
optimality conditions: 

(/2+A)-i = A{J\-A^AA)-^A^ (36) 

{f2+A)-\h2 + X) = A(/i -A^AA)-'(/'ii -A^A) 

We find that the solution isQ 

A = {{{Af^'A^r' 



-Ji} 

= i{(A/fU^)-U/f'/ii 



112} 



(37) 



'1 ^ ; ^■'1 

The model {h',J') is then updated by adding (A, A) to 
the coarse-scale and subtracting (A^A,A^AA) from the fine 
scale. This update enforces the moment conditions X2 — 
Ail and P2 = APiA^ while maintaining consistency of the 
model [h' ,J'). Similar updates can be derived when there 
are multiple replicas of x\ and xi- These methods, together 
with those described previously, are used to minimize the 
dual function in the Gaussian multiscale relaxation. 

Multiscale Example. We provide a preliminary result in- 
volving a ID thin-membrane model with 1024 nodes. It is 
defined to have a long correlation length comparable to the 
length of the field. Using a random /i-vector, we solve for 
the MAP estimates using three methods: a standard block 
Gauss-Seidel iteration using overlapping blocks of size 4; 
the (single-scale) Gaussian LR method with the same choice 
of blocks; and the multiscale LR method. The convergence 
of all three methods are shown in Fig. [8] We see that the 
single-scale LR approach is moderately faster than block 
Gauss-Seidel, but introducing coarser-scales into the method 
leads to a significant speed-up in the rate of convergence. 

'^The formula (37) corresponds to a generalization of Algorithm 2, in 
which the moments {x\.Pi) of fine-scale variables x\ are replaced by the 
corresponding moments {Axi,APiA^ ) of the summary statistic x\ =Axi. 
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Fig. 8. Convergence of single- and multi-scale LR and block Gauss-Seidel. 

VI. Discussion 

We have introduced a general Lagrangian relaxation 
framework for MAP estimation in both discrete and Gaussian 
graphical models. This provides a new interpretation of 
some existing methods, provides deeper insights into those 
methods, and leads to new generalizations, such as the mul- 
tiscale relaxation introduced here. There are many promising 
directions for further work. While we have considered dis- 
crete and Gaussian models separately, the basic approach 
should extend to the richer class of conditionally Gaussian 
models [1] including both discrete and continuous variables. 
In discrete models, designing augmented models that capture 
more structure of the original problem leads to reduced 
duality gaps and optimal MAP estimates in larger classes 
of models. It would be of great interest to finds ways to 
adaptively search this hierarchy of relaxations to efficiently 
reduce and eventually eliminate the duality gap with minimal 
computation. It is also of interest to consider approaches 
to identity provably near-optimal estimates, perhaps using 
the relaxed max-marginal estimates, in cases where it is not 
tractable to completely eliminate the duality gap. 
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